ABSTRACT
The massive amount of genomic data appearing over the past two years for SARS-CoV-2 has challenged traditional methods for studying the dynamics of the COVID-19 pandemic. As a result, new methods, such as the Pangolin tool, have appeared which can scale to the millions of samples of SARS-CoV-2 currently available. Such a tool is tailored to take assembled, aligned and curated full-length sequences, such as those provided by GISAID, as input. As high-throughput sequencing technologies continue to advance, such assembly, alignment and curation may become a bottleneck, creating a need for methods which can process raw sequencing reads directly. In this paper, we propose several alignment-free embedding approaches, which can generate a fixed-length feature vector representation directly from the raw sequencing reads, without the need for assembly. Moreover, because such an embedding is a numerical representation, it can be passed to already highly optimized clustering methods such as k-means. We show that the clusterings we obtain with the proposed embeddings are more suited to this setting than the Pangolin tool, based on several internal clustering evaluation metrics. Moreover, we show that a disproportionate number of positions in the spike region of the SARS-CoV-2 genome are informing such clusterings (in terms of information gain), which is consistent with current biological knowledge of SARS-CoV-2. © 2022, The Author(s), under exclusive license to Springer Nature Switzerland AG.